feat(mothership): implement auto-provisioning with manifest by deanq · Pull Request #136 · runpod/flash

deanq · 2026-01-09T10:16:01Z

Summary

Implement automatic child endpoint provisioning when the Mothership (LoadBalancerSlsResource) boots up. The mothership reads the local manifest, reconciles with State Manager's persisted manifest, deploys/updates/deletes child resources accordingly, sets FLASH_MOTHERSHIP_URL on each child, and serves a /manifest endpoint for service discovery.

Key Features:

Background provisioning task (non-blocking) for fast cold starts
Intelligent reconciliation: deploy new, update changed, delete removed resources
Skips LoadBalancer resources during provisioning (avoids self-deployment)
Idempotent provisioning using config hashes to detect changes
State Manager integration for persistent manifest state across boots

What's Included

Core Implementation

src/tetra_rp/runtime/mothership_provisioner.py - Main provisioning logic with manifest reconciliation
src/tetra_rp/runtime/state_manager_client.py - HTTP client for State Manager API
/manifest endpoint in LB handler for service discovery

Deployment Configuration

LoadBalancerSlsResource sets FLASH_IS_MOTHERSHIP=true env var
Mothership URL constructed from RUNPOD_ENDPOINT_ID
Manifest file (flash_manifest.json) loaded during boot

Comprehensive Tests

Unit tests for provisioner functions (reconciliation, URL construction, etc.)
Integration tests for end-to-end provisioning workflow
Resource drift detection tests for manifest reconciliation

Documentation

Updated docs/Cross_Endpoint_Routing.md with architecture details
Fixed terminology inconsistencies (Directory → Manifest)

Bug Fixes

Fixed critical endpoint bug in manifest_client.py (was querying /directory, now queries /manifest)
Updated exception references throughout codebase

Testing

All tests passing
Quality checks: format, lint, type checking all passing

Related Issues

AE-1660: Mothership auto-provisioning implementation

Implement a factory function that creates RunPod serverless handlers, eliminating code duplication across generated handler files. The generic_handler module provides: - create_handler(function_registry) factory that accepts a dict of function/class objects and returns a RunPod-compatible handler - Automatic serialization/deserialization using cloudpickle + base64 - Support for both function execution and class instantiation + method calls - Structured error responses with full tracebacks for debugging - Load manifest for cross-endpoint function discovery This design centralizes all handler logic in one place, making it easy to: - Fix bugs once, benefit all handlers - Add new features without regenerating projects - Keep deployment packages small (handler files are ~23 lines each) Implementation: - deserialize_arguments(): Base64 + cloudpickle decoding - serialize_result(): Cloudpickle + base64 encoding - execute_function(): Handles function vs. class execution - load_manifest(): Loads flash_manifest.json for service discovery

…uild process Implement the build pipeline components that work together to generate serverless handlers from @Remote decorated functions. Three core components: 1. RemoteDecoratorScanner (scanner.py) - Uses Python AST to discover all @Remote decorated functions - Extracts function metadata: name, module, async status, is_class - Groups functions by resource_config for handler generation - Handles edge cases like decorated classes and async functions 2. ManifestBuilder (manifest.py) - Groups functions by their resource_config - Creates flash_manifest.json structure for service discovery - Maps functions to their modules and handler files - Enables cross-endpoint function routing at runtime 3. HandlerGenerator (handler_generator.py) - Creates lightweight handler_*.py files for each resource config - Each handler imports functions and registers them in FUNCTION_REGISTRY - Handler delegates to create_handler() factory from generic_handler - Generated handlers are ~23 lines (vs ~98 with duplication) Build Pipeline Flow: 1. Scanner discovers @Remote functions 2. ManifestBuilder groups them by resource_config 3. HandlerGenerator creates handler_*.py for each group 4. All files + manifest bundled into archive.tar.gz This eliminates ~95% duplication across handlers by using the factory pattern instead of template-based generation.

Implement 19 unit tests covering all major paths through the generic_handler factory and its helper functions. Test Coverage: Serialization/Deserialization (7 tests): - serialize_result() with simple values, dicts, lists - deserialize_arguments() with empty, args-only, kwargs-only, mixed inputs - Round-trip encoding/decoding of cloudpickle + base64 Function Execution (4 tests): - Simple function execution with positional and keyword arguments - Keyword argument handling - Class instantiation and method calls - Argument passing to instance methods Handler Factory (8 tests): - create_handler() returns callable RunPod handler - Handler with simple function registry - Missing function error handling (returns error response, not exception) - Function exceptions caught with traceback included - Multiple functions in single registry - Complex Python objects (classes, lambdas, closures) - Empty registry edge case - Default execution_type parameter - None return values - Correct RunPod response format (success, result/error, traceback) Test Strategy: - Arrange-Act-Assert pattern for clarity - Isolated unit tests (no external dependencies) - Tests verify behavior, not implementation - Error cases tested for proper error handling - All serialization tested for round-trip correctness All tests passing, 83% coverage on generic_handler.py

…canning Implement integration tests validating the build pipeline components work correctly together. Test Coverage: HandlerGenerator Tests: - Handler files created with correct names (handler_<resource_name>.py) - Generated files import required functions from workers - FUNCTION_REGISTRY properly formatted - create_handler() imported from generic_handler - Handler creation via factory - RunPod start call present and correct - Multiple handlers generated for multiple resource configs ManifestBuilder Tests: - Manifest structure with correct version and metadata - Resources grouped by resource_config - Handler file paths correct - Function metadata preserved (name, module, is_async, is_class) - Function registry mapping complete ScannerTests: - @Remote decorated functions discovered via AST - Function metadata extracted correctly - Module paths resolved properly - Async functions detected - Class methods detected - Edge cases handled (multiple decorators, nested classes) Test Strategy: - Integration tests verify components work together - Tests verify generated files are syntactically correct - Tests validate data structures match expected schemas - No external dependencies in build process Validates that the entire build pipeline: 1. Discovers functions correctly 2. Groups them appropriately 3. Generates valid Python handler files 4. Creates correct manifest structure

Add comprehensive architecture documentation explaining why the factory pattern was chosen and how it works. Documentation includes: Overview & Context: - Problem statement: Handler files had 95% duplication - Design decision: Use factory function instead of templates - Benefits: Single source of truth, easier maintenance, consistency Architecture Diagrams (MermaidJS): - High-level flow: @Remote functions → Scanner → Manifest → Handlers → Factory - Component relationships: HandlerGenerator, GeneratedHandler, generic_handler - Function registry pattern: Discovery → Grouping → Registration → Factory Implementation Details: - create_handler(function_registry) signature and behavior - deserialize_arguments(): Base64 + cloudpickle decoding - serialize_result(): Cloudpickle + base64 encoding - execute_function(): Function vs. class execution - load_manifest(): Service discovery via flash_manifest.json Design Decisions (with rationale): - Factory Pattern over Inheritance: Simpler, less coupling, easier to test - CloudPickle + Base64: Handles arbitrary objects, safe JSON transmission - Manifest in Generic Handler: Runtime service discovery requirement - Structured Error Responses: Debugging aid, functional error handling - Both Execution Types: Supports stateful classes and pure functions Usage Examples: - Simple function handler - Class execution with methods - Multiple functions in one handler Build Process Integration: - 4-phase pipeline: Scanner → Grouping → Generation → Packaging - Manifest structure and contents - Generated handler structure (~23 lines) Testing Strategy: - 19 unit tests covering all major paths - 7 integration tests verifying handler generation - Manual testing with example applications Performance: - Zero runtime penalty (factory called once at startup) - No additional indirection in request path

Document the flash build command and update CLI README to include it. New Documentation: flash-build.md includes: Usage & Options: - Command syntax: flash build [OPTIONS] - --no-deps: Skip transitive dependencies (faster, smaller archives) - --keep-build: Keep build directory for inspection/debugging - --output, -o: Custom archive name (default: archive.tar.gz) What It Does (5-step process): 1. Discovery: Scan for @Remote decorated functions 2. Grouping: Group functions by resource_config 3. Handler Generation: Create lightweight handler files 4. Manifest Creation: Generate flash_manifest.json 5. Packaging: Create archive.tar.gz for deployment Build Artifacts: - .flash/archive.tar.gz: Deployment package (ready for RunPod) - .flash/flash_manifest.json: Service discovery configuration - .flash/.build/: Temporary build directory Handler Generation: - Explains factory pattern and minimal handler files - Links to Runtime_Generic_Handler.md for details Dependency Management: - Default behavior: Install all dependencies including transitive - --no-deps: Only direct dependencies (when base image has transitive) - Trade-offs explained Cross-Endpoint Function Calls: - Example showing GPU and CPU endpoints - Manifest enables routing automatically Output & Troubleshooting: - Sample build output with progress indicators - Common failure scenarios and solutions - How to debug with --keep-build Next Steps: - Test locally with flash run - Deploy to RunPod - Monitor with flash undeploy list Updated CLI README.md: - Added flash build to command list in sequence - Links to full flash-build.md documentation

Add a new section explaining how the build system works and why the factory pattern reduces code duplication. New Section: Build Process and Handler Generation Explains: How Flash Builds Your Application (5-step pipeline): 1. Discovery: Scans code for @Remote decorated functions 2. Grouping: Groups functions by resource_config 3. Handler Generation: Creates lightweight handler files 4. Manifest Creation: Generates flash_manifest.json for service discovery 5. Packaging: Bundles everything into archive.tar.gz Handler Architecture (with code example): - Shows generated handler using factory pattern - Single source of truth: All handler logic in one place - Easier maintenance: Bug fixes don't require rebuilding projects Cross-Endpoint Function Calls: - Example of GPU and CPU endpoints calling each other - Manifest and runtime wrapper handle service discovery Build Artifacts: - .flash/.build/: Temporary build directory - .flash/archive.tar.gz: Deployment package - .flash/flash_manifest.json: Service configuration Links to detailed documentation: - docs/Runtime_Generic_Handler.md for architecture details - src/tetra_rp/cli/docs/flash-build.md for CLI reference This section bridges the main README and detailed documentation, providing entry point for new users discovering the build system.

Wire up the handler generator, manifest builder, and scanner into the actual flash build command implementation. Changes to build.py: 1. Integration: - Import RemoteDecoratorScanner for function discovery - Import ManifestBuilder for manifest creation - Import HandlerGenerator for handler file creation - Call these in sequence during the build process 2. Build Pipeline: - After copying project files, scan for @Remote functions - Build manifest from discovered functions - Generate handler files for each resource config - Write manifest to build directory - Progress indicators show what's being generated 3. Fixes: - Change .tetra directory references to .flash - Uncomment actual build logic (was showing "Coming Soon" message) - Fix progress messages to show actual file counts 4. Error Handling: - Try/catch around handler generation - Warning shown if generation fails but build continues - User can debug with --keep-build flag Build Flow Now: 1. Load ignore patterns 2. Collect project files 3. Create build directory 4. Copy files to build directory 5. [NEW] Scan for @Remote functions 6. [NEW] Build and write manifest 7. [NEW] Generate handler files 8. Install dependencies 9. Create archive 10. Clean up build directory (unless --keep-build) Dependencies: - Updated uv.lock with all required dependencies

…handling **Critical Fixes:** - Remove "Coming Soon" message blocking build command execution - Fix build directory to use .flash/.build/ directly (no app_name subdirectory) - Fix tarball to extract with flat structure using arcname="." - Fix cleanup to remove correct build directory **Error Handling & Validation:** - Add specific exception handling (ImportError, SyntaxError, ValueError) - Add import validation to generated handlers - Add duplicate function name detection across resources - Add proper error logging throughout build process **Resource Type Tracking:** - Add resource_type field to RemoteFunctionMetadata - Track actual resource types (LiveServerless, CpuLiveServerless) - Use actual types in manifest instead of hardcoding **Robustness Improvements:** - Add handler import validation post-generation - Add manifest path fallback search (cwd, module dir, legacy location) - Add resource name sanitization for safe filenames - Add specific exception logging in scanner (UnicodeDecodeError, SyntaxError) **User Experience:** - Add troubleshooting section to README - Update manifest path documentation in docs - Change "Zero Runtime Penalty" to "Minimal Runtime Overhead" - Mark future enhancements as "Not Yet Implemented" - Improve build success message with next steps Fixes all 20 issues identified in code review (issues #1-13, #19-22)

Implement LoadBalancerSlsResource class for provisioning RunPod load-balanced serverless endpoints. Load-balanced endpoints expose HTTP servers directly to clients without queue-based processing, enabling REST APIs, webhooks, and real-time communication patterns. Key features: - Type enforcement (always LB, never QB) - Scaler validation (REQUEST_COUNT required, not QUEUE_DELAY) - Health check polling via /ping endpoint (200/204 = healthy) - Post-deployment verification with configurable retries - Async and sync health check methods - Comprehensive unit tests - Full documentation with architecture diagrams and examples Architecture: - Extends ServerlessResource with LB-specific behavior - Validates configuration before deployment - Polls /ping endpoint until healthy (10 retries × 5s = 50s timeout) - Raises TimeoutError if endpoint fails to become healthy This forms the foundation for Mothership architecture where a load-balanced endpoint serves as a directory server for child endpoints.

Import ServerlessResource directly and use patch.object on the imported class instead of string-based patches. This ensures the mocks properly intercept the parent class's _do_deploy method when called via super(). Simplifies mock configuration and removes an unused variable assertion. Fixes the three failing deployment tests that were making real GraphQL API calls. All tests now pass: 418 passed, 1 skipped.

…oints Implement core infrastructure for enabling @Remote decorator on LoadBalancerSlsResource endpoints with HTTP method/path routing. Changes: - Create LoadBalancerSlsStub: HTTP-based stub for direct endpoint execution (src/tetra_rp/stubs/load_balancer_sls.py, 170 lines) - Serializes functions and arguments using cloudpickle + base64 - Direct HTTP POST to /execute endpoint (no queue polling) - Proper error handling and deserialization - Register stub with singledispatch (src/tetra_rp/stubs/registry.py) - Enables @Remote to dispatch to LoadBalancerSlsStub for LB resources - Extend @Remote decorator with HTTP routing parameters (src/tetra_rp/client.py) - Add 'method' parameter: GET, POST, PUT, DELETE, PATCH - Add 'path' parameter: /api/endpoint routes - Validate method/path required for LoadBalancerSlsResource - Store routing metadata on decorated functions/classes - Warn if routing params used with non-LB resources Foundation for Phase 2 (Build system integration) and Phase 3 (Local dev).

Update RemoteDecoratorScanner to extract HTTP method and path from @Remote decorator for LoadBalancerSlsResource endpoints. Changes: - Add http_method and http_path fields to RemoteFunctionMetadata - Add _extract_http_routing() method to parse decorator keywords - Extract method (GET, POST, PUT, DELETE, PATCH) from decorator - Extract path (/api/process) from decorator - Store routing metadata for manifest generation Foundation for Phase 2.2 (Manifest updates) and Phase 2.3 (Handler generation).

Enhance ManifestBuilder to support HTTP method/path routing for LoadBalancerSlsResource endpoints. Changes: - Add http_method and http_path fields to ManifestFunction - Validate LB endpoints have both method and path - Detect and prevent route conflicts (same method + path) - Prevent use of reserved paths (/execute, /ping) - Add 'routes' section to manifest for LB endpoints - Conditional inclusion of routing fields (only for LB) Manifest structure for LB endpoints now includes: { "resources": { "api_service": { "resource_type": "LoadBalancerSlsResource", "functions": [ { "name": "process_data", "http_method": "POST", "http_path": "/api/process" } ] } }, "routes": { "api_service": { "POST /api/process": "process_data" } } }

Implement LBHandlerGenerator to create FastAPI applications for LoadBalancerSlsResource endpoints with HTTP method/path routing. Key features: - Generates FastAPI apps with explicit route registry - Creates (method, path) -> function mappings from manifest - Validates route conflicts and reserved paths - Imports user functions and creates dynamic routes - Includes required /ping health check endpoint - Validates generated handler Python syntax via import Generated handler structure enables: - Direct HTTP routing to user functions via FastAPI - Framework /execute endpoint for @Remote stub execution - Local development with uvicorn

Create create_lb_handler() factory function that dynamically builds FastAPI applications from route registries for LoadBalancerSlsResource endpoints. Key features: - Accepts route_registry: Dict[(method, path)] -> handler_function mapping - Registers all user-defined routes from registry to FastAPI app - Provides /execute endpoint for @Remote stub function execution - Handles async function execution automatically - Serializes results with cloudpickle + base64 encoding - Comprehensive error handling with detailed logging The /execute endpoint enables: - Remote function code execution via @Remote decorator - Automatic argument deserialization from cloudpickle/base64 - Result serialization for transmission back to client - Support for both sync and async functions

Update build command to use appropriate handler generators based on resource type. Separates LoadBalancerSlsResource endpoints (using FastAPI) from queue-based endpoints (using generic handler). Changes: - Import LBHandlerGenerator alongside HandlerGenerator - Inspect manifest resources and separate by type - Generate LB handlers via LBHandlerGenerator - Generate QB handlers via HandlerGenerator - Combine all generated handler paths for summary Enables users to mix LB and QB endpoints in same project with correct code generation for each resource type.

Implement LiveLoadBalancer resource following the LiveServerless pattern for local development and testing of load-balanced endpoints. Changes: - Add TETRA_LB_IMAGE constant for load-balanced Tetra image - Create LiveLoadBalancer class extending LoadBalancerSlsResource - Uses LiveServerlessMixin to lock imageName to Tetra LB image - Register LiveLoadBalancer with LoadBalancerSlsStub in singledispatch - Export LiveLoadBalancer from core.resources and top-level __init__ This enables users to test LB-based functions locally before deploying, using the same pattern as LiveServerless for queue-based endpoints. Users can now write: from tetra_rp import LiveLoadBalancer, remote api = LiveLoadBalancer(name="test-api") @Remote(api, method="POST", path="/api/process") async def process_data(x, y): return {"result": x + y} result = await process_data(5, 3) # Local execution

Implement unit tests for LoadBalancerSlsStub covering: - Request preparation with arguments and dependencies - Response handling for success and error cases - Error handling for invalid responses - Base64 encoding/decoding of serialized data - Endpoint URL validation - Timeout and HTTP error handling Test coverage: - _prepare_request: 4 tests - _handle_response: 5 tests - _execute_function: 3 error case tests - __call__: 2 integration tests Tests verify proper function serialization, argument handling, error propagation, and response deserialization.

Fix test_load_balancer_vs_queue_based_endpoints by updating the @Remote decorator to use method='POST' and path='/api/echo' to match the test assertions. This was a test-level bug where the decorator definition didn't match what was being asserted.

…ndpoints - Using_Remote_With_LoadBalancer.md: User guide for HTTP routing, local development, building and deploying - LoadBalancer_Runtime_Architecture.md: Technical details on deployment, request flows, security, and performance - Updated README.md with LoadBalancer section and code example - Updated Load_Balancer_Endpoints.md with cross-references to new guides

Split @Remote execution behavior between local and deployed: - LiveLoadBalancer (local): Uses /execute endpoint for function serialization - LoadBalancerSlsResource (deployed): Uses user-defined routes with HTTP param mapping Changes: 1. LoadBalancerSlsStub routing detection: - _should_use_execute_endpoint() determines execution path - _execute_via_user_route() maps args to JSON and POSTs to user routes - Auto-detects resource type and routing metadata 2. Conditional /execute registration: - create_lb_handler() now accepts include_execute parameter - Generated handlers default to include_execute=False (security) - LiveLoadBalancer can enable /execute if needed 3. Updated handler generator: - Added clarity comments on /execute exclusion for deployed endpoints 4. Comprehensive test coverage: - 8 new tests for routing detection and execution paths - All 31 tests passing (22 unit + 9 integration) 5. Documentation updates: - Using_Remote_With_LoadBalancer.md: clarified /execute scope - Added 'Local vs Deployed Execution' section explaining differences - LoadBalancer_Runtime_Architecture.md: updated execution model - Added troubleshooting for deployed endpoint scenarios Security improvement: - Deployed endpoints only expose user-defined routes - /execute endpoint removed from production (prevents arbitrary code execution) - Lower attack surface for deployed endpoints

…lude /execute endpoint - Modified manifest.py to validate LiveLoadBalancer endpoints like LoadBalancerSlsResource - Updated lb_handler_generator to: - Include LiveLoadBalancer in handler generation filter - Pass include_execute=True for LiveLoadBalancer (local dev) - Pass include_execute=False for LoadBalancerSlsResource (deployed) - Added integration tests: - Verify LiveLoadBalancer handlers include /execute endpoint - Verify deployed handlers exclude /execute endpoint - Fixes critical bug: LiveLoadBalancer now gets /execute endpoint in generated handlers

…ss resources - Updated scanner to extract LiveLoadBalancer and LoadBalancerSlsResource resources - Previously only looked for 'Serverless' in class name, missing LoadBalancer endpoints - Now checks for both 'Serverless' and 'LoadBalancer' in resource type names - Added integration test to verify scanner discovers both resource types - Fixes critical bug that prevented flash build from finding LoadBalancer endpoints

- Wrap long lines in manifest.py, lb_handler.py, and load_balancer_sls.py - Remove unused httpx import in test_load_balancer_sls_stub.py - Apply consistent formatting across codebase

- Scanner: Use exact type name matching instead of substring matching - Whitelist specific resource types to avoid false positives - Prevents matching classes like 'MyServerlessHelper' or 'LoadBalancerUtils' - Type hints: Use Optional[str] for nullable fields in manifest - ManifestFunction.http_method and http_path now properly typed - Timeout: Make HTTP client timeout configurable - Added LoadBalancerSlsStub.DEFAULT_TIMEOUT class attribute - Added timeout parameter to __init__ - Updated both _execute_function and _execute_via_user_route to use self.timeout - Deprecated datetime: Replace datetime.utcnow() with datetime.now(timezone.utc) - Updated manifest.py and test_lb_remote_execution.py - Ensures Python 3.12+ compatibility

The set_serverless_template model_validator was being overwritten by sync_input_fields (both had mode="after"). In Pydantic v2, when two validators with the same mode are defined in a class, only one is registered. This caused templates to never be created from imageName, resulting in: "GraphQL errors: One of templateId, template is required to create an endpoint" Solution: - Move set_serverless_template validator from ServerlessResource base class to subclasses (ServerlessEndpoint and LoadBalancerSlsResource) where the validation is actually needed - Keep helper methods (_create_new_template, _configure_existing_template) in base class for reuse - Add comprehensive tests for LiveLoadBalancer template serialization This allows: 1. Base ServerlessResource to be instantiated freely for testing/configuration 2. Subclasses (ServerlessEndpoint, LoadBalancerSlsResource) to enforce template requirements during deployment 3. Proper template serialization in GraphQL payload for RunPod API Fixes: One of templateId, template is required to create an endpoint error when deploying LiveLoadBalancer with custom image tags like runpod/tetra-rp-lb:local

- Fix: Use correct endpoint URL format for load-balanced endpoints (https://{id}.api.runpod.ai instead of https://api.runpod.ai/v2/{id}) This fixes 404 errors on /ping health check endpoints - Feature: Add CPU LoadBalancer support * Create CpuLoadBalancerSlsResource for CPU-based load-balanced endpoints * Create CpuLiveLoadBalancer for local CPU LB development * Add TETRA_CPU_LB_IMAGE constant for CPU LB Docker image * Update example code to use CpuLiveLoadBalancer for CPU worker * Add 8 comprehensive tests for CPU LoadBalancer functionality - Tests: Add 2 tests for endpoint URL format validation - All 474 tests passing, 64% code coverage

…etra_rp package LoadBalancer resources were not being discovered by ResourceDiscovery because the new CPU variants (CpuLiveLoadBalancer, CpuLoadBalancerSlsResource) were not exported from the main tetra_rp package. This prevented undeploy from picking up these resources. Added exports to: - TYPE_CHECKING imports for type hints - __getattr__ function for lazy loading - __all__ list for public API This fixes the issue where 'flash undeploy list' could not find LoadBalancer resources that were deployed with 'flash run --auto-provision'.

Update documentation to consistently use 'Manifest' instead of 'Directory': - Replace DirectoryClient references with StateManagerClient (actual implementation) - Update architecture diagram to reference /manifest endpoint instead of DirectoryClient - Fix ServiceRegistry code examples to use /manifest endpoint - Update extension point for custom directory backends - Fix testing section to reference actual test files (MothershipProvisioner, StateManagerClient) - Update debugging section with /manifest endpoint examples - Clarify that directory is loaded from mothership /manifest endpoint These changes ensure documentation matches the actual AE-1660 implementation.

Critical fix: Update ManifestClient to query /manifest endpoint instead of /directory Changes: - Fix ManifestClient.get_directory() to query /manifest endpoint (not /directory) - Update ManifestClient docstring: 'manifest directory service' → '/manifest endpoint' - Fix DirectoryUnavailableError → ManifestServiceUnavailableError in docs - Update example URLs from 'api.runpod.io' to actual LB endpoint format - Clarify in docstrings that this queries the mothership's /manifest endpoint This bug would have caused runtime failures when querying the mothership directory, as the actual endpoint served by lb_handler_generator.py is /manifest, not /directory.

Copilot

Pull request overview

This PR implements automatic child endpoint provisioning for the Mothership (LoadBalancerSlsResource) using manifest reconciliation. The mothership reads the local manifest file on boot, reconciles it with a persisted manifest from State Manager, and automatically deploys, updates, or deletes child resources to match the desired state.

Changes:

Added mothership auto-provisioning system with intelligent reconciliation logic
Implemented State Manager client for persistent manifest state tracking
Added /manifest endpoint for service discovery
Fixed endpoint bug in manifest_client.py (corrected /directory to /manifest)

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit/runtime/test_mothership_provisioner.py	Comprehensive unit tests covering provisioner functions including URL construction, manifest loading, hash computation, reconciliation logic, and directory extraction
tests/integration/test_mothership_provisioning.py	Integration tests for end-to-end provisioning workflows including first boot, changes detection, resource removal, error handling, and idempotency
tests/integration/test_lb_remote_execution.py	Updated test assertions to include new lifespan parameter in handler creation
src/tetra_rp/runtime/state_manager_client.py	New HTTP client for State Manager API with methods for fetching/updating/removing persisted manifest state
src/tetra_rp/runtime/mothership_provisioner.py	Core provisioning logic implementing manifest reconciliation, resource deployment/update/deletion, and directory mapping
src/tetra_rp/runtime/manifest_client.py	Fixed endpoint from `/directory` to `/manifest` with updated documentation
src/tetra_rp/runtime/lb_handler.py	Added lifespan parameter to handler creation for startup/shutdown hooks
src/tetra_rp/core/resources/load_balancer_sls_resource.py	Sets FLASH_IS_MOTHERSHIP=true env var during deployment
src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py	Added lifespan context manager with mothership provisioning logic and `/manifest` endpoint
docs/Cross_Endpoint_Routing.md	Updated documentation with mothership auto-provisioning architecture and terminology corrections

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ithub.com/runpod/tetra-rp into deanq/ae-1660-mothership-deploys-manifest

Changes FLASH_MOTHERSHIP_URL to FLASH_MOTHERSHIP_ID for cleaner environment configuration. Child endpoints now use FLASH_RESOURCE_NAME to identify which resource config they represent in the manifest. Changes: - ManifestClient: Construct URL from FLASH_MOTHERSHIP_ID instead of full URL - ServiceRegistry: Use FLASH_RESOURCE_NAME with fallback to RUNPOD_ENDPOINT_ID - Add tomli dependency for Python <3.11 pyproject.toml parsing (needed for build.py) Benefits: - Simpler environment configuration (ID instead of full URL) - Clear distinction between mothership (RUNPOD_ENDPOINT_ID) and children (FLASH_RESOURCE_NAME) - Consistent URL construction pattern Files modified: - src/tetra_rp/runtime/manifest_client.py - src/tetra_rp/runtime/service_registry.py - pyproject.toml - uv.lock

Removes LoadBalancer resource filtering to enable multi-tier architectures. Adds cache validation to prevent stale resources from being deployed after codebase refactoring. Provisioning Changes: - Remove LoadBalancer filtering in reconcile_manifests() - Support CpuLiveLoadBalancer, LiveLoadBalancer, LoadBalancerSlsResource - Add filter_resources_by_manifest() to validate cached resources against manifest - Add test-mothership mode with "tmp-" prefix for temporary test endpoints - Change env vars: FLASH_MOTHERSHIP_URL -> FLASH_MOTHERSHIP_ID Resource Manager Changes: - Track all created resources (deployed = has ID) regardless of health status - Cache resources even if deployment completes with errors - Ensures cleanup capability for all created resources Cache Validation: - Prevents stale resources from old codebase versions being redeployed - Validates: resource name exists in manifest + type matches - Logs removed stale entries for visibility Benefits: - Multi-tier load balancing architectures now supported - No orphaned resources from refactored code - Better resource lifecycle management - Reliable cleanup of all created resources Files modified: - src/tetra_rp/runtime/mothership_provisioner.py - src/tetra_rp/core/resources/resource_manager.py

…ements Enables bundling local tetra_rp source into builds for development and testing. Updates LB handler to serve authoritative manifest from State Manager. Build System Changes: - Add _find_local_tetra_rp() to detect development installations - Add _bundle_local_tetra_rp() to copy source into build directory - Add _extract_tetra_rp_dependencies() to parse pyproject.toml for deps - Add _remove_tetra_from_requirements() to clean up after bundling - Skip bundling for PyPI installations (site-packages) LB Handler Changes: - Store StateManagerClient in module-level state for /manifest endpoint - Update /manifest endpoint to fetch from State Manager (single source of truth) - Add proper error handling for uninitialized state client - Restrict /manifest endpoint to mothership only (403 for children) - Improve provisioning startup logging for clarity Benefits: - Test-mothership can use local tetra_rp changes without publishing - Manifest endpoint serves complete authoritative state - Child endpoints get consistent configuration from single source - Better development workflow for framework changes Files modified: - src/tetra_rp/cli/commands/build.py - src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py

Adds --force flag to undeploy for non-interactive cleanup (needed by test-mothership). Improves resource discovery visibility with debug logging. Undeploy Changes: - Add --force/-f flag to skip confirmation prompts - Update _undeploy_by_name(), _undeploy_all(), _interactive_undeploy() to support skip_confirm - Enables automated cleanup in CI/CD and test-mothership shutdown Discovery Changes: - Add detailed logging at each discovery phase (entry point, static imports, directory scan) - Log discovered resource names and types for debugging - Exclude .flash/ directory from project scanning (build artifacts) Run Command Changes: - Add resource discovery debug output showing found resources - Display resource names and types before server startup CLI Main Changes: - Register test-mothership command (note: implementation was in commit 1) Benefits: - Test-mothership can cleanup automatically without user interaction - Better visibility into resource discovery process - Easier debugging of discovery issues - Clean separation of interactive vs automated workflows Files modified: - src/tetra_rp/cli/commands/undeploy.py - src/tetra_rp/cli/commands/run.py - src/tetra_rp/core/discovery.py - src/tetra_rp/cli/main.py

Updates all tests to reflect LoadBalancer provisioning, FLASH_RESOURCE_NAME usage, and removal of obsolete test cases. Mothership Provisioner Tests: - Update tests to expect LoadBalancer resources in provisioning (not skipped) - Fix create_resource_from_manifest tests to use RUNPOD_ENDPOINT_ID env var - Update UnsupportedResourceType test (LoadBalancer now supported) - Remove obsolete get_manifest_directory() tests (function removed) Service Registry Tests: - Update all tests to use FLASH_RESOURCE_NAME instead of RUNPOD_ENDPOINT_ID - Add test for FLASH_RESOURCE_NAME priority with RUNPOD_ENDPOINT_ID fallback - Update test names to reflect new behavior Integration Tests: - Update test_provision_children_skips_load_balancer_resources to test_provision_children_deploys_load_balancer_resources - Fix assertions to expect 2 deployments (LoadBalancer + worker) - Remove obsolete test_manifest_directory_endpoint_after_provisioning Manifest Client Tests: - Update initialization tests for FLASH_MOTHERSHIP_ID usage - Update error message expectations Test Rationale: - LoadBalancer provisioning enables multi-tier architectures - FLASH_RESOURCE_NAME provides clearer child endpoint identification - Removed tests for deleted functionality (get_manifest_directory) Files modified: - tests/unit/runtime/test_mothership_provisioner.py - tests/unit/runtime/test_service_registry.py - tests/integration/test_mothership_provisioning.py - tests/unit/runtime/test_manifest_client.py

…irectories Changes: - Modified LBHandlerGenerator to use importlib pattern instead of from imports - Aligns LB handlers with QB handler pattern for consistency - Fixes SyntaxError when building projects with numeric directory names (e.g., 03_advanced_workers) - Added boolean flags (is_load_balanced, is_live_resource) to replace string comparisons - Added test coverage for numeric module paths The bug occurred because Python identifiers cannot start with digits, but importlib treats module paths as strings, allowing any valid filesystem path.

Changes: - Scanner now tracks config variable names (e.g., "gpu_config") at scan time - Manifest includes config_variable field for each resource and function - test-mothership uses config_variable from manifest for reliable discovery - Added backward compatibility fallback to old search logic Fixes "No config variable found" warnings when resource names differ from variable names (e.g., resource "03_05_load_balancer_gpu" with variable "gpu_config"). This enables test-mothership to correctly discover and provision all resources including load balancer endpoints, resolving health check failures.

Changes: - Replace MD5 with SHA-256 for config hash computation (security best practice) - Add error callback to background provisioning task for proper exception handling - Update tests to expect SHA-256 hash length (64 chars instead of 32) Addresses Copilot review comments: - mothership_provisioner.py:113 - Use SHA-256 instead of cryptographically broken MD5 - lb_handler_generator.py:81 - Track background task and add error callback

jhcipar · 2026-01-14T19:11:56Z

+            try:
+                client = await self._get_client()
+                response = await client.get(
+                    f"{self.base_url}/api/v1/flash/manifests/{mothership_id}",


does this still need to get updated to use the graphql endpoints?

Ah yes. What is the correct way?

…hip-deploys-manifest # Conflicts: # tests/integration/test_lb_remote_execution.py

deanq added 30 commits January 3, 2026 01:22

chore: Format code for line length and remove unused imports

db28ae0

- Wrap long lines in manifest.py, lb_handler.py, and load_balancer_sls.py - Remove unused httpx import in test_load_balancer_sls_stub.py - Apply consistent formatting across codebase

style: Format datetime chaining for line length

0218995

deanq added 2 commits January 9, 2026 02:05

deanq changed the base branch from main to deanq/ae-1196-absolute-drift-detection January 9, 2026 10:18

Base automatically changed from deanq/ae-1196-absolute-drift-detection to main January 12, 2026 04:12

Merge branch 'main' into deanq/ae-1660-mothership-deploys-manifest

9723e1a

deanq requested a review from Copilot January 12, 2026 10:26

Copilot AI reviewed Jan 12, 2026

View reviewed changes

Comment thread src/tetra_rp/runtime/mothership_provisioner.py Outdated

Comment thread src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py Outdated

deanq and others added 9 commits January 12, 2026 13:30

Merge branch 'main' into deanq/ae-1660-mothership-deploys-manifest

7abff9d

Merge branch 'deanq/ae-1660-mothership-deploys-manifest' of https://g…

a1dc695

…ithub.com/runpod/tetra-rp into deanq/ae-1660-mothership-deploys-manifest

Merge branch 'main' into deanq/ae-1660-mothership-deploys-manifest

abc01b8

Merge branch 'deanq/ae-1660-mothership-deploys-manifest' of https://g…

2eff912

…ithub.com/runpod/tetra-rp into deanq/ae-1660-mothership-deploys-manifest

deanq changed the title ~~feat(mothership): implement auto-provisioning with manifest reconciliation~~ feat(mothership): implement auto-provisioning with manifest Jan 13, 2026

deanq added 2 commits January 14, 2026 00:38

deanq mentioned this pull request Jan 14, 2026

feat: mothership manifest sync and caching #140

Merged

deanq and others added 2 commits January 14, 2026 02:02

Merge branch 'main' into deanq/ae-1660-mothership-deploys-manifest

36f72e8

jhcipar approved these changes Jan 14, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into deanq/ae-1660-mothers…

fe44047

…hip-deploys-manifest # Conflicts: # tests/integration/test_lb_remote_execution.py

deanq merged commit 14effd4 into main Jan 14, 2026
7 checks passed

deanq deleted the deanq/ae-1660-mothership-deploys-manifest branch January 14, 2026 22:32

runpod-release-please-bot Bot mentioned this pull request Jan 14, 2026

chore: release 0.20.0 #134

Merged

This was referenced Feb 6, 2026

chore: release 2.0.0 #184

Closed

chore: release 2.0.0 #186

Closed

chore: release 1.1.0 #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mothership): implement auto-provisioning with manifest#136

feat(mothership): implement auto-provisioning with manifest#136
deanq merged 77 commits intomainfrom
deanq/ae-1660-mothership-deploys-manifest

deanq commented Jan 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

jhcipar Jan 14, 2026

Uh oh!

deanq Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deanq commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Included

Core Implementation

Deployment Configuration

Comprehensive Tests

Documentation

Bug Fixes

Testing

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

jhcipar Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

deanq Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deanq commented Jan 9, 2026 •

edited

Loading